# Downloading bill data from LegiScan

There is a website called [LegiScan](https://legiscan.com/). From their about page:
    
> LegiScan launched to support the release of the national LegiScan data service, providing the nation's first impartial real-time legislative tracking service designed for both public citizens and government affairs professionals across all sectors in organizations large and small. Utilizing the LegiScan API, having nearly 20 years of development maturity, allows us to provide monitoring of every bill in the 50 states and Congress. Giving our users and clients a central and uniform interface with the ability to easily track a wide array of legislative information. Paired with one of the country's most powerful national full bill text legislative search engines.

We're using to use their API to **download data on over a million different pieces of legislation in the US.** 

<p class="reading-options">
  <a class="btn" href="/azcentral-text-reuse-model-legislation/01-downloading-one-million-pieces-of-legislation-from-legiscan">
    <i class="fa fa-sm fa-book"></i>
    Read online
  </a>
  <a class="btn" href="/azcentral-text-reuse-model-legislation/notebooks/01-Downloading one million pieces of legislation from LegiScan.ipynb">
    <i class="fa fa-sm fa-download"></i>
    Download notebook
  </a>
  <a class="btn" href="https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/azcentral-text-reuse-model-legislation/notebooks/01-Downloading one million pieces of legislation from LegiScan.ipynb" target="_new">
    <i class="fa fa-sm fa-laptop"></i>
    Interactive version
  </a>
</p>

## Imports

In [10]:
import zipfile
import base64
import io
import glob
import time
import json
import os
import requests
import mimetypes

## pylegiscan

To talk to LegiScan's API, we're borrowing some code from [pylegiscan](https://github.com/poliquin/pylegiscan). Since it isn't a package you can install with `pip`, it wound up being easier for distribution to just cut and paste it here.

In [11]:
# Taken from https://github.com/poliquin/pylegiscan/blob/master/pylegiscan/legiscan.py

import os
import json
import requests
from urllib.parse import urlencode
from urllib.parse import quote_plus

# current aggregate status of bill
BILL_STATUS = {1: "Introduced",
               2: "Engrossed",
               3: "Enrolled",
               4: "Passed",
               5: "Vetoed",
               6: "Failed/Dead"}

# significant steps in bill progress.
BILL_PROGRESS = {1: "Introduced",
                 2: "Engrossed",
                 3: "Enrolled",
                 4: "Passed",
                 5: "Vetoed",
                 6: "Failed/Dead",
                 7: "Veto Override",
                 8: "Chapter/Act/Statute",
                 9: "Committee Referral",
                10: "Committee Report Pass",
                11: "Committee Report DNP"}


"""
Interact with LegiScan API.

"""

# a helpful list of valid legiscan state abbreviations (no Puerto Rico)
STATES = ['ak', 'al', 'ar', 'az', 'ca', 'co', 'ct', 'dc', 'de', 'fl', 'ga',
          'hi', 'ia', 'id', 'il', 'in', 'ks', 'ky', 'la', 'ma', 'md', 'me',
          'mi', 'mn', 'mo', 'ms', 'mt', 'nc', 'nd', 'ne', 'nh', 'nj', 'nm',
          'nv', 'ny', 'oh', 'ok', 'or', 'pa', 'ri', 'sc', 'sd', 'tn', 'tx',
          'ut', 'va', 'vt', 'wa', 'wi', 'wv', 'wy']

class LegiScanError(Exception):
    pass

class LegiScan(object):
    BASE_URL = 'http://api.legiscan.com/?key={0}&op={1}&{2}'

    def __init__(self, apikey=None):
        """LegiScan API.  State parameters should always be passed as
           USPS abbreviations.  Bill numbers and abbreviations are case
           insensitive.  Register for API at http://legiscan.com/legiscan
        """
        # see if API key available as environment variable
        if apikey is None:
            apikey = os.environ['LEGISCAN_API_KEY']
        self.key = apikey.strip()

    def _url(self, operation, params=None):
        """Build a URL for querying the API."""
        if not isinstance(params, str) and params is not None:
            params = urlencode(params)
        elif params is None:
            params = ''
        return self.BASE_URL.format(self.key, operation, params)

    def _get(self, url):
        """Get and parse JSON from API for a url."""
        req = requests.get(url)
        if not req.ok:
            raise LegiScanError('Request returned {0}: {1}'\
                    .format(req.status_code, url))
        data = json.loads(req.content)
        if data['status'] == "ERROR":
            raise LegiScanError(data['alert']['message'])
        return data

    def get_session_list(self, state):
        """Get list of available sessions for a state."""
        url = self._url('getSessionList', {'state': state})
        data = self._get(url)
        return data['sessions']

    def get_dataset_list(self, state=None, year=None):
        """Get list of available datasets, with optional state and year filtering.
        """
        if state is not None:
            url = self._url('getDatasetList', {'state': state})
        elif year is not None:
            url = self._url('getDatasetList', {'year': year})
        else:
            url = self._url('getDatasetList')
        data = self._get(url)
        # return a list of the bills
        return data['datasetlist']

    def get_dataset(self, id, access_key):
        """Get list of available datasets, with optional state and year filtering.
        """
        url = self._url('getDataset', {'id': id, 'access_key': access_key})
        data = self._get(url)
        # return a list of the bills
        return data['dataset']
      
    def get_master_list(self, state=None, session_id=None):
        """Get list of bills for the current session in a state or for
           a given session identifier.
        """
        if state is not None:
            url = self._url('getMasterList', {'state': state})
        elif session_id is not None:
            url = self._url('getMasterList', {'id': session_id})
        else:
            raise ValueError('Must specify session identifier or state.')
        data = self._get(url)
        # return a list of the bills
        return [data['masterlist'][i] for i in data['masterlist']]

    def get_bill(self, bill_id=None, state=None, bill_number=None):
        """Get primary bill detail information including sponsors, committee
           references, full history, bill text, and roll call information.

           This function expects either a bill identifier or a state and bill
           number combination.  The bill identifier is preferred, and required
           for fetching bills from prior sessions.
        """
        if bill_id is not None:
            url = self._url('getBill', {'id': bill_id})
        elif state is not None and bill_number is not None:
            url = self._url('getBill', {'state': state, 'bill': bill_number})
        else:
            raise ValueError('Must specify bill_id or state and bill_number.')
        return self._get(url)['bill']

    def get_bill_text(self, doc_id):
        """Get bill text, including date, draft revision information, and
           MIME type.  Bill text is base64 encoded to allow for PDF and Word
           data transfers.
        """
        url = self._url('getBillText', {'id': doc_id})
        return self._get(url)['text']

    def get_amendment(self, amendment_id):
        """Get amendment text including date, adoption status, MIME type, and
           title/description information.  The amendment text is base64 encoded
           to allow for PDF and Word data transfer.
        """
        url = self._url('getAmendment', {'id': amendment_id})
        return self._get(url)['amendment']

    def get_supplement(self, supplement_id):
        """Get supplement text including type of supplement, date, MIME type
           and text/description information.  Supplement text is base64 encoded
           to allow for PDF and Word data transfer.
        """
        url = self._url('getSupplement', {'id': supplement_id})
        return self._get(url)['supplement']

    def get_roll_call(self, roll_call_id):
        """Roll call detail for individual votes and summary information."""
        data = self._get(self._url('getRollcall', {'id': roll_call_id}))
        return data['roll_call']

    def get_sponsor(self, people_id):
        """Sponsor information including name, role, and a followthemoney.org
           person identifier.
        """
        url = self._url('getSponsor', {'id': people_id})
        return self._get(url)['person']

    def search(self, state, bill_number=None, query=None, year=2, page=1):
        """Get a page of results for a search against the LegiScan full text
           engine; returns a paginated result set.

           Specify a bill number or a query string.  Year can be an exact year
           or a number between 1 and 4, inclusive.  These integers have the
           following meanings:
               1 = all years
               2 = current year, the default
               3 = recent years
               4 = prior years
           Page is the result set page number to return.
        """
        if bill_number is not None:
            params = {'state': state, 'bill': bill_number}
        elif query is not None:
            params = {'state': state, 'query': query,
                      'year': year, 'page': page}
        else:
            raise ValueError('Must specify bill_number or query')
        data = self._get(self._url('search', params))['searchresult']
        # return a summary of the search and the results as a dictionary
        summary = data.pop('summary')
        results = {'summary': summary, 'results': [data[i] for i in data]}
        return results

    def __str__(self):
        return '<LegiScan API {0}>'.format(self.key)

    def __repr__(self):
        return str(self)

# Connect to LegiScan

Using pylegiscan, you just pass your API key to `LegiScan` and you're good to go. I set up an environment variable for mine, but you can also just paste yours at `OR_PUT_YOUR_API_KEY_HERE`.

In [7]:
api_key = os.environ.get('LEGISCAN_API_KEY', 'OR_PUT_YOUR_API_KEY_HERE')
legis = LegiScan(api_key)

If you wanted to search for bills based on state or text, that's easy to do.

In [8]:
bills = legis.search(state='tx', query='abortion')
bills['summary'] # how many results did we get?

{'page': '1 of 2',
 'range': '1 - 50',
 'relevancy': '100% - 87%',
 'count': 59,
 'page_current': '1',
 'page_total': 2,
 'query': '(Zabort:(pos=1))'}

You can also get single bills, one at a time, as long as you know their ID in the LegiScan database.

In [9]:
legis.get_bill('1256258')

{'bill_id': 1256258,
 'change_hash': 'c5fc7aa0673f84a38a93d77abf9a29d8',
 'session_id': 1650,
 'session': {'session_id': 1650,
  'session_name': '123rd General Assembly',
  'session_title': '123rd General Assembly',
  'year_start': 2019,
  'year_end': 2020,
  'special': 0},
 'url': 'https://legiscan.com/SC/bill/H4523/2019',
 'state_link': 'https://www.scstatehouse.gov/billsearch.php?billnumbers=4523&session=123&summary=B',
 'completed': 1,
 'status': 4,
 'status_date': '2019-05-02',
 'progress': [{'date': '2019-05-02', 'event': 1},
  {'date': '2019-05-02', 'event': 4}],
 'state': 'SC',
 'state_id': 40,
 'bill_number': 'H4523',
 'bill_type': 'R',
 'bill_type_id': '2',
 'body': 'H',
 'body_id': 85,
 'current_body': 'H',
 'current_body_id': 85,
 'title': 'Debi Chard',
 'description': 'Honor Debi Chard On The Occasion Of Her Retirement From Wcsc Live 5 News In Charleston, South Carolina, After Forty-three Years Of Dedicated Service And To Wish Her Many Happy Years In A Well-deserved Retire

# LegiScan Datasets

It'd take forever to download the bills one at a time, so we take advantage of LegiScan's [datasets](https://legiscan.com/datasets) capability. They're a whole set of bill data for each session of the legislature.

In [11]:
datasets = legis.get_dataset_list()
dataset = legis.get_dataset(datasets[20]['session_id'], datasets[20]['access_key'])
dataset.keys()

dict_keys(['state_id', 'session_id', 'session_name', 'dataset_hash', 'dataset_date', 'dataset_size', 'mime_type', 'zip'])

They come in a _really_ weird format, though: a [base64-encoded](https://en.wikipedia.org/wiki/Base64) zip file. SO first we need to convert the base64 zipfile into a normal file, then unzip it!

In [12]:
z_bytes = base64.b64decode(dataset['zip'])
z = zipfile.ZipFile(io.BytesIO(z_bytes))
z.extractall("./sample-data")

It creates a lot lot lot lot lot of `.json` files. For example, let's take a look at a sample of what we just extracted.

In [1]:
import glob

filenames = glob.glob("./sample-data/*/*/bill/*", recursive=True)
filenames[:15]

['./sample-data/AK/2017-2018_30th_Legislature/bill/SCR10.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/SB124.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB65.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB392.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB238.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HCR25.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB111.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HJR2.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HCR1.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB404.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB280.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/SB173.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/SCR9.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HCR401.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB32.json']

Each file has all sorts of information about the bill, but **none of the text of the bill itself!** You can see for yourself:

In [35]:
import json

json_data = json.load(open("./sample-data/AK/2017-2018_30th_Legislature/bill/SCR10.json"))
json_data

{'bill': {'bill_id': 1004624,
  'change_hash': '557d10e3e229284c17c4354e988bad06',
  'session_id': 1397,
  'session': {'session_id': 1397,
   'session_name': '30th Legislature',
   'session_title': '30th Legislature',
   'year_start': 2017,
   'year_end': 2018,
   'special': 0},
  'url': 'https://legiscan.com/AK/bill/SCR10/2017',
  'state_link': 'http://www.akleg.gov/basis/Bill/Detail/30?Root=SCR10',
  'completed': 0,
  'status': 3,
  'status_date': '2018-04-28',
  'progress': [{'date': '2017-04-07', 'event': 1},
   {'date': '2018-02-02', 'event': 10},
   {'date': '2018-02-09', 'event': 2},
   {'date': '2018-03-09', 'event': 10},
   {'date': '2018-04-28', 'event': 3}],
  'state': 'AK',
  'state_id': 2,
  'bill_number': 'SCR10',
  'bill_type': 'CR',
  'bill_type_id': '3',
  'body': 'S',
  'body_id': 14,
  'current_body': 'H',
  'current_body_id': 13,
  'title': 'Alaska Year Of Innovation',
  'description': 'Proclaiming 2019 to be the Year of Innovation in Alaska.',
  'committee': [],
  

You _can_ download the bill text if you have the ID, but... for some reason we don't do this. I'm going to be honest: I don't remember why. Maybe it's because they're older versions? They're incomplete? I truly have forgetten.

In [8]:
doc = legis.get_bill_text('2015157')
contents = base64.b64decode(doc['doc'])
with open("filename.html", "wb") as file:
    file.write(contents)

What we're going to need is the **URL to the published version.**

In [42]:
json_data['bill']['texts'][-1]

{'doc_id': 1790359,
 'date': '2018-05-01',
 'type': 'Enrolled',
 'type_id': 5,
 'mime': 'application/pdf',
 'mime_id': 2,
 'url': 'https://legiscan.com/AK/text/SCR10/id/1790359',
 'state_link': 'http://www.legis.state.ak.us/PDF/30/Bills/SCR010Z.PDF',
 'text_size': 592822}

We're going to need the URL to the published version from _every single one of those JSON files_.

# Download and extract all of the datasets from LegiScan

In [46]:
datasets = legis.get_dataset_list()
len(datasets)

583

Downloading and extracting all 583 is going to take a while, so we'll use a progress bar from [tqdm](https://github.com/tqdm/tqdm) to keep track of where we're at.

In [59]:
import tqdm

total = len(datasets)
for dataset in tqdm.tqdm_notebook(datasets):
    session_id = dataset['session_id']
    access_key = dataset['access_key']
    details = legis.get_dataset(session_id, access_key)
    z_bytes = base64.b64decode(details['zip'])
    z = zipfile.ZipFile(io.BytesIO(z_bytes))
    z.extractall("./bill_data")

HBox(children=(IntProgress(value=0, max=583), HTML(value='')))




# Converting the many JSON files to single CSV file

The data isn't doing us much good sitting around as a zillion json files, so we'll convert them into a CSV file with the pieces of information we're interested in. Those pieces are:

* State
* Bill title
* Bill URL

In [5]:
filenames = glob.glob("bill_data/*/*/bill/*.json")
len(filenames)

1253402

In [6]:
filenames[:5]

['bill_data/VT/2011-2012_Regular_Session/bill/HCR143.json',
 'bill_data/VT/2011-2012_Regular_Session/bill/H0291.json',
 'bill_data/VT/2011-2012_Regular_Session/bill/S0162.json',
 'bill_data/VT/2011-2012_Regular_Session/bill/S0027.json',
 'bill_data/VT/2011-2012_Regular_Session/bill/H0784.json']

If we want to process over a million rows, it's going to take a while! To speed things up we're going to turn to [swifter](https://github.com/jmcarpenter2/swifter), a package that can parallelize work on pandas dataframes. It's pretty easy to use:

**without swifter:**

```python
df = pd.Series(filenames).apply(process_json)
```

**with swifter:**

```python
df = pd.Series(filenames).swifter.apply(process_json)
```

And it does all the hard work for you! You just use it and hope for the best.

In [9]:
import json
import os
import swifter
import pandas as pd

def process_json(filename):
    with open(filename) as file:
        bill_data = {}
        # We need to do a little string replacing so the 
        json_str = file.read().replace('"0000-00-00"', 'null')
        content = json.loads(json_str)['bill']

        bill_data['bill_id'] = content['bill_id']
        bill_data['code'] = os.path.splitext(os.path.basename(filename))[0]
        bill_data['bill_number'] = content['bill_number']
        bill_data['title'] = content['title']
        bill_data['description'] = content['description']
        bill_data['state'] = content['state']
        bill_data['session'] = content['session']['session_name']
        bill_data['filename'] = filename
        bill_data['status'] = content['status']
        bill_data['status_date'] = content['status_date']

        try:
            bill_data['url'] = content['texts'][-1]['state_link']
        except:
            pass

        return pd.Series(bill_data)

df = pd.Series(filenames).swifter.apply(process_json)
df.head()

HBox(children=(IntProgress(value=0, description='Pandas Apply', max=1253402, style=ProgressStyle(description_w…




Unnamed: 0,bill_id,code,bill_number,title,description,state,session,filename,status,status_date,url
0,325258,HCR143,HCR143,House Concurrent Resolution Congratulating The...,House Concurrent Resolution Congratulating The...,VT,2011-2012 Session,bill_data/VT/2011-2012_Regular_Session/bill/HC...,4,2011-04-22,http://www.leg.state.vt.us/docs/2012/Acts/ACTR...
1,285625,H0291,H0291,An Act Relating To Raising The Penalties For A...,An Act Relating To Raising The Penalties For A...,VT,2011-2012 Session,bill_data/VT/2011-2012_Regular_Session/bill/H0...,1,2011-02-22,http://www.leg.state.vt.us/docs/2012/bills/Int...
2,398232,S0162,S0162,An Act Relating To Powers Of Attorney,An Act Relating To Powers Of Attorney,VT,2011-2012 Session,bill_data/VT/2011-2012_Regular_Session/bill/S0...,1,2012-01-03,http://www.leg.state.vt.us/docs/2012/bills/Int...
3,243054,S0027,S0027,An Act Relating To The Role Of Municipalities ...,An Act Relating To The Role Of Municipalities ...,VT,2011-2012 Session,bill_data/VT/2011-2012_Regular_Session/bill/S0...,1,2011-01-25,http://www.leg.state.vt.us/docs/2012/bills/Int...
4,417691,H0784,H0784,An Act Relating To Approval Of The Adoption An...,An Act Relating To Approval Of The Adoption An...,VT,2011-2012 Session,bill_data/VT/2011-2012_Regular_Session/bill/H0...,4,2012-05-05,http://www.leg.state.vt.us/docs/2012/Acts/ACTM...


And now we'll save it to prepare for the next step: **inserting it into a database.**

In [12]:
df.to_csv("data/bills-with-urls.csv", index=False)